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Abstract. Computational tools and resources play an important role for vocabulary 
acquisition. Although a large variety of dictionaries and learning games are available, 
few resources provide information about the complexity of a word, either for learning 
or for comprehension. The idea here is to use frequency counts combined with intra- 
lexical variables to account for the difficulty of a word. By using such predictors, we 
have built a lexical resource for French, ReSyf, where words are available with their 
word senses and their synonyms according to a complexity level. In this paper, we 
will present the methodology used to build the resource and we will discuss possible 
applications, from enhancing vocabulary acquisition to text simplification. 
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1. Introduction 

Computational tools and resources play an important role for vocabulary 
acquisition. Encouraged by the extensive use of mobile devices, recent intelligent 
Computer-Assisted Language Learning (iCALL) applications and platforms 
propose a large variety of learning games that offer challenging possibilities (see 
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Cornillie, Thome, & Desmet, 2012) compared to more traditional exercises 
which emphasize repetition (explicit learning 5 6 ). Recent educational tools are 
These features were on modern pedagogical criteria, offering among other things 
hyperlinks to electronic dictionaries or concordancers. The information a student 
can find is related to word forms (morphology), word meanings (semantics, usage) 
and word patterns (syntax, collocations). These electronic resources may even offer 
information concerning the origins of the word, its particular usage (constmctions), 
and typically related words (semantically or thematically, word- families). 

However, very few tools provide information about the complexity of a word, 
either for learning or for comprehension (showing for example that ‘monster’ is 
a simpler term than its hyponyms ‘phoenix’ or ‘behemoth’, or that ‘to walk’ is 
easier than its synonyms ‘to stroll’ or ‘to ramble’). Yet, the idea of using frequency 
counts as a proxy for word difficulty is not new: frequency word lists were built 
in the past, see for instance Thorndike (1921) and Gougenheim (1958). 

The principle behind these lists is that the frequency of a word affects its 
recognition, thus its acquisition. Based on that principle, which basically relies 
on corpus-based features, some resources have been built where words are 
classified across difficulty levels: the English Profile Wordlists (Capel, 2010), 
or for French, Manulex (Fete, Sprenger-Charolles, & Cole, 2004) and FLELex 
(Francis, Gala, Watrin, & Fairon, 2014). 

In this paper we introduce a new graded lexical resource for French synonyms, 
ReSyf, that relies on a large set of intra-lexical and psycholinguistic features 
to assign a grade level to each synonym. The resource was first introduced in 
Gala, Francois, Bernhard, and Fairon (2013), but the current version has been 
enhanced, especially as regards the possibility to discriminate between synsets 
of synonyms. 

2. Methodology to build ReSyf 

ReSyf 5 is a lexical database with graded word-senses containing 31,141 entries, 
extracted, filtered and annotated using different resources available for French. A 


5. Explicit and implicit learning depends on the users’ attention paid to the words (Ma & Kelly, 2006); exercises 
specifically focused on vocabulary (explicit learning) or activities where lexical acquisition rather occurs implicitly, 
as a side-effect: the student is repeatedly exposed to words, like in reading. 

6. Retrieved from http://resyf.lif.univ-mrs.fr 
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predictive model based on lexical and psycholinguistic features related to lexical 
complexity (Gala et al., 2014) was used to assign grade levels to the entries and 
synonyms. 

2.1. From words to word-senses 

The entries were extracted from the list of concepts 7 in French from BabelNet 2.5.1 
(Navigli & Ponzetto, 2012). As defined by Navigli and Ponzetto (2012), a “concept 
in [BabelNet] is represented as a synonym set (called synset)[: a] set of words that 
share the same meaning. For instance, the concept of play as a dramatic work is 
expressed by the following synsef ’ dramejeu dramatique, piece (theatre), piece de 
theatre , texte dramatique , oeuvre dramatique (p. 218). Each concept was extracted 
with a list of associated weighted synsets. The weight of a synset corresponds to its 
semantic connections in BabelNet. 

The list obtained at this stage was filtered with two French reference resources: 
Lexique 3 (New, Pallier, Ferrand, & Matos, 2001) and the Tresor de la Langue 
Franchise informatise (TLFi) (Dendien & Pierrel, 2003). While BabelNet provided 
an important amount of data, the reference resources were used to validate the 
lexical items and to remove wrong items (words in languages other than French, 
orthographic errors, rare or domain- specific terms, etc.). The monosemic words 
without information from BabelNet were enriched with synonyms extracted from 
JeuxDeMots (Lafourcade, 2007). 

All the words obtained were tagged with TreeTagger 8 (Schmid, 1994). As a result 
of the Part Of Speech (POS) tagging, we removed wrong lemmas and verified that 
all the synonyms of the synset had the same POS tag as the target word. When the 
concept appeared as a multiword expression or collocation, we kept the POS tag 
of the first item (i.e. noun for texte dramatique). Finally, when a concept appeared 
with a hypernym (i.e. piece - theatre) we verified that the hypernym was included 
in the domain list of JeuxDeMots 9 and we kept the hypernym as a synonym. The 
results obtained are reported in Table 1 , while Table 2 presents the number of final 
words containing at least one synonym. 


7. We are interested in the selection of the concepts without taking into account the presence of the named entities. We 
consider a concept as being a sense in the lexical network BabelNet (Navigli & Ponzetto, 2012). 

8. TreeTagger, morphosyntactic annotation tool, retrieved from http://www.cis.uni-muenchen.de/~schmid/tools/ 
TreeTagger/ 

9. There are 3,086 different domains in JeuxDeMots, version 05/2015. 
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Table 1 . Number of target words in ReSyf filtered with Lexique 3 and TLFi 


POS 

One sense 

Polysemic 

Total 

Nouns 

15 482 

16 825 

32 307 

Verbs 

1 003 

1 890 

2 893 

Adjectives 

2 620 

2 032 

4 652 

Adverbs 

451 

595 

1 046 

Total 

19 556 

21 342 

40 898 


Table 2. Number of target words in ReSyf for which each sense contains at least 
one synonym 


Nouns 

Verbs 

Adjectives 

Adverbs 

Total 

24 599 

2 612 

3 094 

836 

31 141 


2.2. Word-senses graduation with a level of difficulty 

The difficulty of the entries and the synonyms in ReSyf was established as the 
result of a twofold process. First, we gathered a “gold standard” list of about 
19,000 frequent words in French having difficulty level information. This resource 
was obtained from Manulex by Gala et al. (2014), who transformed the frequency 
distribution across the three levels in Manulex into a single grade level. 

In a second step, we trained a statistical model on this gold standard, using 49 intra- 
lexical and psycholinguistic features. The set of features is further detailed in 
Gala et al. (2014) and includes the number of letters, the number of phonemes, 
the number of syllables, the presence of specific spelling patterns, the syllabic 
structure, morphemic information, the number of meanings, frequency of the 
word in Lexique3, etc. These features were combined within a linear Support 
Vector Machine (SVM) model with L2 regularization 10 . The final model reached 
an accuracy of 63%, which was 2% better than the baseline relying only on 
frequency as a predictor, but still was not a satisfactory result for our purpose. As 


10. The only metaparameter of this model is the cost, which was set to 0,5 as the result of a grid search between values 
of 100 and 0,001. 
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a consequence, in the building of ReSyf, our model was used only to assign grade 
level to words absent from Manulex. Otherwise, the level assigned was directly 
obtained from this resource. 

Table 3 presents an example of an entry: the word jeu (game) with its POS tag and 
its grade. The list of graded synonyms has also a weight according to the relevance 
of the sense. 

Table 3. Example of a ReSyf entry with different lists of synonyms 


Target word 

Graded synonyms 

jeu_N_l 

[partie_N_l, catch_N_3]+1425 
[jeu de hasard_N_l, pari_N_l]+918 
[drame_N_2, jeu dramatique_N_2, piece 
(theatre)_N_l, piece de theatre_N_l, texte 
dramatique_N_2, oeuvre dramatique_N_3]+392 
[casse-tete_N_3 , puzzle_N_ 1 ] +3 92 
[blague_N_l, bouffonnerie_N_l, farce_N_l, 
plaisanterie_N_ 1 , tour_N_ 1 ] +3 45 


3. Conclusion 

In this paper, we have presented a graded lexicon for French synonyms where 
words account for a level of complexity calculated from frequency counts, intra- 
lexical and psycholinguistic features. While aimed at text simplification, this graded 
lexicon can also help learners of French to acquire vocabulary and to improve 
language acquisition. On the one hand, the lexicon itself can be used for explicit 
learning of French vocabulary guided by the different grades of the synonyms of 
a word. On the other hand, it can be used to carry out word substitution within an 
automatic text simplification system aiming at helping learners and children with 
reading impairments to get through a text, rediscovering the pleasure of reading 
(as they can better understand what they read), and thus entering a virtuous circle, 
whereby reading and decoding skills are trained through reading practice. 
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